from IPython import display ;display.Image("https://images.ctfassets.net/vz6nkkbc6q75/3avkbm7rsk9oOWAEPBy9YY/4d8d64a8613352398af4e651edd4b001/NYC-Blog-Header-837x335-1__1_.jpg",width = 900, height = 50)
The report is based on the 2018 trips data from Citi Bikes website consisting of 17,548,339 trips(2018). Initially, the volumes are observed in light of calendar and weather features, followed by clustering exercises of days and bike stations and expansion opportunities are explored in potential neighborhoods.Finally, prediction problems are illustrated with results and discussion. To start with, calendar are derived out of the start time stamps to proceed with the analysis after import.
#<!--eofm-->
%run User_defined_Functions_EDA_RQ.ipynb
bike = import_data('Data\Trips_2018.csv'); bike = calculated_fields(bike)
bike[['start_time_stamp','dayofmonth', 'start_month', 'week_number','start_hour', 'start_monthname', 'start_dayofweek_name',
'start_dayofweek', 'weekend', 'tripduration_min']].head(n=2)
%run User_defined_Functions_EDA_RQ.ipynb ;
DailyPlots(bike,25,35) ; print("Time lapse on a weekday");weekdays_and_weekends_trip_duration(bike);bike_heat_map(bike)
A weekly pattern in volumes can be observed(plot 1). In the next two plots,Weekday volumes peaked around 7am-9am and 5pm-7pm indicating rush hour. Weekend volume peaked between 1pm-5 pm indicating informal outings.Therefore,hours and days of the week appear as important factors determining hourly volumes. Further, longer trip durations were observed during the mid of the year with the gap in trip duration levels between January and June being ~50%.
Daily weather data of 2018 was sourced and combined with the monthly pickups data to look for any correlation between daily volumes and weather elements
%run User_defined_Functions_EDA_RQ.ipynb
bike_weather = import_weather(); weather_plots(bike_weather) # contains data at trip level after joining the weather elements(which are at daily level)
Temperature has high +ve correlation with volumes making summers conducive for biking as opposed to snowfall and snow depth(-ve correlated. Wind has a -ve but weak correlation, indicating that the wind speeds are generally low. Rain(PRCP) has +ve correlation which is counter-intuitive, albeit low. A hourly correlation check may have revealed a +ve relation.5-7 day weather forecasts, therefore appear to be good candidates as predictors for pickup volumes (see studies)
Learning from the effect of weather conditions on bike usage, days of the year were attempted to be clustered. Attributes such as weekday/weekend or event(holidays or weather-related) which could affect bike traffic were added alongside hourly volumes.
%run User_defined_Functions_EDA_RQ.ipynb
bike_volume= group_by_hour(bike); bike_volume = insert_events(bike_volume); dt = create_df_days(bike_volume)
%store dt
dt_reduced = estimate_PCA_model_short(dt, n_components=2) #this is the dataframe with the 2 dimensions obtained from PCA
#gaussian mixture model clustering is performed and the clusters are assigned to a new dataframe
dt_reduced_2 = gaussian_mixture_clustering(dt_reduced[[0,1]], n_components=4);add_more_event_days(dt_reduced)
Although silhouette score obtained is less than the one obtained for the K-means algorithm, it partitions the data in a way that aligns better with what was expected. The plot below shows a partition according to whether the day belongs to the class 'weekend or special event' or the class 'workday'. How a 'special event' was defined can be read on the Appendix I - Temporal Clustering.
%run User_defined_Functions_EDA_RQ.ipynb
#weather data and seasons data encoded is brought into a new dataframe
fnew_2 = weather_by_day(dt_reduced) ;classifier(fnew_2);print('Show cluster sizes \n', fnew_2.cluster.value_counts());
fnew_2.head(2)
Hour-wise avg. pickup volumes for weekdays and weekends respectively were calculated at station level and were attempted to be clustered using K-means method. K=4 was chosen and cluster means were plotted.
%run User_defined_Functions_EDA_RQ.ipynb
combined_data = bike_clustering(bike);create_clusters(combined_data) ; n_cluster=4; combined_data_clusters,cluster_size = fitted_clusters_and_plots(combined_data,n_cluster) ;
bike_station = visualize_clusters_on_map(combined_data_clusters,bike,cluster_size) ; combined_data.head(n=2)
The cluster centers differed in magnitudes of pickups but not on trends(such as different peak hours across clusters).The biggest cluster( >50% of stations) has the lowest hourly pickups. The smallest cluster with only 35 stations has the highest volume indicating a demand imbalance and need for overnight bike rebalancing. Predicting pickup volumes (for rebalancing) for the respective clusters may be useful as the cluster centers are very distinct.
As a final exploration, the demographics of the 59 neighborhoods of NY and the density of bike stations were analysed to look for opportunities of expanding the network.
%run User_defined_Functions_EDA_RQ.ipynb
district_level_metrics, district_level_metrics_2,neigh,neigh_bike, c_d_2 = neighborhood_demographics_import_processed(); c_d_2.to_excel('neighborhoods_demogs_bike_subway_counts.xlsx')
print("Underlying data for plots: sample"); print(c_d_2.head(n=1).T) ; chloropeth_plots(c_d_2); print("Scrollable consolidated map: ")
c_d_2.explore() # Scroll over to look at the various metrics for each neighborhood.
Bike station densities have positive but low correlations with population densities, car free commuters% and subway counts. Based on the plots, areas outlined in red could be possible candidates for network expansion as they have low median income, subway connectivity,medium population density and high dependence on public transport.(Motivated from this report)
Introduction - Antarlina
Exploratory Analysis - Antarlina
Research Qs 1:Monthly pickups and weather conditions - Antarlina
Research Qs 2:Temporal Clustering on pickup volumes - Lucas
Research Qs 3:Clustering stations on volumes - Antarlina
Research Qs 4:Bike Stations:: Expansion Opportunity - Antarlina
Volume Prediction 1: 8 weeks + 1 week - Lucas (50%) Antarlina (50%)
Volume Prediction 2: 10 months + 2 months - Lucas (70%) Antarlina (30%)
Discussions & Conclusions - Lucas (50%) Antarlina (50%)
Based on the exploration and prediction exercises, it can be reasonably concluded that weather forecasts(RQ:1) might be very useful to predict short spans(such as Model 1) whereas months and seasons(which sums up the weather conditions) were good predictors for Model 2. A decent accuracy of .86 for Model 1 can be improved by addition of weather forecasts. The accuracy steadily declines when used to predict other months as a model cannot predict a bahavior it hasnt seen and therefore has poor generalizability.
Model 2, however,privy to 10 months of data was better equiped to predict 2 months, but fell short as December has a peculiar holiday behavior which the model hadnt learnt from. An accuracy of 0.58 could probably be improved by adding past year's data so that the model learns about the holiday behavior as well. Further, in order to improve class balance, stratified sampling could be performed across the months so that the model can predict each behavior equally well.
In both scenarios, predictions at station cluster level(RQ:3) would have improved the accuracy while adding the output of day level clusters(RQ:2) as a flag. Finally, the bike network expansion could be considered( RQ:4) by marrying a study of neighborhood demographics and accurate volume predictions.
import os
print("PYTHONPATH:", os.environ.get('PYTHONPATH'))
print("PATH:", os.environ.get('PATH'))
import tensorflow
from tensorflow import keras